Sample data may be able to be used directly but you'll at least need to research how the data was collected and determining its uncertainty.
For all sample data, research the following questions:
For all types of values (spatial, temporal, and measurement) you can have gross errors. These are typically from transcription errors where users may have entered values into the wrong field or transposed a number from say "3" to an "8" without realizing it.
- Extended digits
- Gridding such as 0.25, 0.5, 0.75
Every sample data set should be viewed visually for unexpected bias and patterning. Look for non-random and uniform distributions of points.
In some cases, bias can be detected by examining the "distance to roads" or other features that effected sample data collection. If this is the case, and there are a lot of points, the bias may be able to be removed by subtracting a trend (e.g. distance to roads) from the sample data.
- Number in polygons of states, counties, etc.
Gridding of data can be determined by visual inspection in some data sets a nearest neighbor analysis will also show gridding as the distances between the points will have a tendency rather than being random.
- Non-decimal values. Include blanks and text such as names, dates, and DMS values
- 0 values:
- Swapped between lat/lon, signs swapped
Dates may or may not be used. They should be checked to see if the predictors are within a reasonable range of time to be matched with the samples (i.e. samples from the 1800s may not match with a recent climate layer).
Note that databases will "extend" date values. If you enter a year into a database field of type date, the database will extend the date to January 1st, midnight, of the specified year.
Now is the time to decide if the sample data might work for you particular problem.
© Copyright 2018 HSU - All rights reserved.